Lending Club Loans
Posted on Dim 23 septembre 2018 in Machine Learning
Lending Club Project¶
Lending Club is a marketplace for personal loans that matches borrowers who are seeking a loan with investors looking to lend money and make a return.
The data set contains approved loans from 2007 to 2011.
I'll predict if a borrower will pay off their loan on time or no ?
1) Data Cleaning¶
import pandas as pd
# Remove the First line
loans_2007 = pd.read_csv('LoanStats3a.csv', skiprows=1)
# Remove all columns containing more than 50% missing values
half_count = len(loans_2007) / 2
loans_2007 = loans_2007.dropna(thresh=half_count, axis=1)
# Remove useless columns
loans_2007 = loans_2007.drop(['desc', 'url'],axis=1)
loans_2007.to_csv('loans_2007.csv', index=False)
loans_2007 = pd.read_csv('loans_2007.csv')
print(loans_2007.head())
print(len(loans_2007.columns))
Remove useless columns and columns which leak information from the future.¶
loans_2007 = loans_2007.drop(["id", "member_id", "funded_amnt", "funded_amnt_inv",
"grade", "sub_grade", "emp_title", "issue_d", "zip_code",
"out_prncp", "out_prncp_inv", "total_pymnt", "total_pymnt_inv",
"total_rec_prncp", "total_rec_int", "total_rec_late_fee", "recoveries",
"collection_recovery_fee", "last_pymnt_d", "last_pymnt_amnt"], axis = 1)
print(len(loans_2007.columns))
We passed from 52 columns to 32 columns to build the model.
Explore the Different values in the Target column¶
print(loans_2007["loan_status"].value_counts())
Filter target column and replace target column by categorical values¶
"Fully Paid" and "Charged Off" are the targets.
"Charged Off" means the borrower can't refund its loan. We can notice there is class imbalance between the two class we want to predict.
print(len(loans_2007))
# We just take "Fully Paid" and "Charged Off" columns
loans_2007 = loans_2007[(loans_2007["loan_status"] == "Fully Paid") | (loans_2007["loan_status"] == "Charged Off")]
print(len(loans_2007))
mapping_dict = {
"loan_status": {
"Fully Paid": 1,
"Charged Off": 0
}
}
loans_2007 = loans_2007.replace(mapping_dict)
loans_2007["loan_status"].head()
loans_2007.head()
Remove single columns value¶
drop_columns = []
for col in loans_2007.columns:
# unique returns also counts the Pandas missing value object nan as a value, we then should use dropna before
length = len(loans_2007[col].dropna().unique())
if length == 1:
drop_columns.append(col)
loans_2007 = loans_2007.drop(drop_columns , axis = 1)
print(len(drop_columns))
loans_2007.to_csv('filtered_loans_2007.csv', index=False)
We removed 8 columns which only contained unique values.
2) Preparing Features¶
Calculate number of Null values¶
loans = pd.read_csv('filtered_loans_2007.csv')
null_counts = loans.isnull().sum()
print(null_counts)
Handling Missing Values¶
We'll remove rows with null values and columns with more than 1% of missing values.
loans = loans.drop(["pub_rec_bankruptcies"], axis = 1)
loans = loans.dropna(axis = 0)
print(loans.dtypes.value_counts())
# ReIndexing after removing missing values
loans = loans.reset_index(drop=True)
Explore text columns¶
object_columns_df = loans.select_dtypes(include=['object'])
print(object_columns_df.head(1))
--> Some columns seems to be categorical, we need to explore them with the number of unique values
Explore categorical columns¶
cols = ['home_ownership', 'verification_status', 'emp_length', 'term', 'addr_state']
for column in cols:
print(loans[column].value_counts())
These 5 columns contain categorical values
Explore 'Purpose' and 'Title' columns which look similar¶
print(loans["purpose"].value_counts())
print(loans["title"].value_counts())
Convert features to Categorical columns¶
home_ownership, verification_status, emp_length, and term columns each contain a few discrete categorical values. We'll use dummy variables for columns containing categorical values, it split a column into separate binary columns. For int_rate and revol_util columns we need to remove the '%' and convert to float value.
Between purpose and title columns we select the purpose column because it contain less categorical values.
For the emp_length column we need to do some data engineering because it contain ordered values.
Columns containing date values would require a good amount of feature engineering for them to be potentially useful, so they need to be removed: earliest_cr_line, last_credit_pull_d
Finally addr_state contain too many discrete values.
mapping_dict = {
"emp_length": {
"10+ years": 10,
"9 years": 9,
"8 years": 8,
"7 years": 7,
"6 years": 6,
"5 years": 5,
"4 years": 4,
"3 years": 3,
"2 years": 2,
"1 year": 1,
"< 1 year": 0,
"n/a": 0
}
}
loans = loans.drop(["last_credit_pull_d", "addr_state", "title", "earliest_cr_line","pymnt_plan"], axis = 1)
loans['int_rate'] = loans['int_rate'].str.rstrip('%').astype("float")
loans['revol_util'] = loans['revol_util'].str.rstrip('%').astype("float")
loans = loans.replace(mapping_dict)
loans.head()
Dummy Variables¶
cat_columns = ["home_ownership", "verification_status", "emp_length", "purpose", "term"]
dummy_df = pd.get_dummies(loans[cat_columns])
loans = pd.concat([loans, dummy_df], axis=1)
loans = loans.drop(cat_columns, axis=1)
loans.head()
3) Making Predictions¶
print(loans.info())
Classification with Logistic Regression & Cross Validation¶
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import cross_val_predict, KFold
lr = LogisticRegression()
cols = loans.columns
train_cols = cols.drop("loan_status")
features = loans[train_cols]
target = loans["loan_status"]
# 3 Folds by default
kf = KFold(features.shape[0], random_state=1)
predictions = cross_val_predict(lr, features, target, cv = kf)
predictions = pd.Series(predictions)
#loans = loans.reset_index(drop=True)
tn_filter = (predictions == 0) & (loans["loan_status"] == 0)
tn = len(loans[tn_filter])
tp_filter = (predictions == 1) & (loans["loan_status"] == 1)
tp = len(loans[tp_filter])
fn_filter = (predictions == 0) & (loans["loan_status"] == 1)
fn = len(loans[fn_filter])
fp_filter = (predictions == 1) & (loans["loan_status"] == 0)
fp = len(loans[fp_filter])
tpr = tp / (tp + fn)
fpr = fp / (fp + tn)
print("True Positive Rate (recall): " + str(tpr))
print("False Positive Rate (fall-out): " + str(fpr))
We should optimize for:
- high recall (true positive rate) we want a lot a loans we could invest on.
- low fall-out (false positive rate) we don't want to lose money on bad loans, it minize the risk.
Imbalanced Classes : Penalizing the Classifier with Class Weight¶
We can do this by setting the class_weight parameter to balanced. This tells scikit-learn to penalize the misclassification of the minority class during the training process. the penalty is set to be inversely proportional to the class frequencies.
For the classifier, correctly classifying a row where loan_status is 0 is 6 times more important than correctly classifying a row where loan_status is 1.
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import cross_val_predict
lr = LogisticRegression(class_weight = "balanced")
# 3 Folds by default
kf = KFold(features.shape[0], random_state=1)
predictions = cross_val_predict(lr, features, target, cv = kf)
predictions = pd.Series(predictions)
tn_filter = (predictions == 0) & (loans["loan_status"] == 0)
tn = len(loans[tn_filter])
tp_filter = (predictions == 1) & (loans["loan_status"] == 1)
tp = len(loans[tp_filter])
fn_filter = (predictions == 0) & (loans["loan_status"] == 1)
fn = len(loans[fn_filter])
fp_filter = (predictions == 1) & (loans["loan_status"] == 0)
fp = len(loans[fp_filter])
tpr = tp / (tp + fn)
fpr = fp / (fp + tn)
print(tpr)
print(fpr)
We improved false positive rate
Manual Penalties : Increase penalty for misclassifying¶
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import cross_val_predict
penalty = {
0: 10,
1: 1
}
lr = LogisticRegression(class_weight=penalty)
kf = KFold(features.shape[0], random_state=1)
predictions = cross_val_predict(lr, features, target, cv = kf)
predictions = pd.Series(predictions)
tn_filter = (predictions == 0) & (loans["loan_status"] == 0)
tn = len(loans[tn_filter])
tp_filter = (predictions == 1) & (loans["loan_status"] == 1)
tp = len(loans[tp_filter])
fn_filter = (predictions == 0) & (loans["loan_status"] == 1)
fn = len(loans[fn_filter])
fp_filter = (predictions == 1) & (loans["loan_status"] == 0)
fp = len(loans[fp_filter])
tpr = tp / (tp + fn)
fpr = fp / (fp + tn)
print(tpr)
print(fpr)
While we have fewer false positives, we are also missing opportunities to make more money.
There is a TradeOff between them.
Random Forests¶
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import cross_val_predict
rf = RandomForestClassifier(class_weight="balanced", random_state=1)
kf = KFold(features.shape[0], random_state=1)
predictions = cross_val_predict(rf, features, target, cv = kf)
predictions = pd.Series(predictions)
tn_filter = (predictions == 0) & (loans["loan_status"] == 0)
tn = len(loans[tn_filter])
tp_filter = (predictions == 1) & (loans["loan_status"] == 1)
tp = len(loans[tp_filter])
fn_filter = (predictions == 0) & (loans["loan_status"] == 1)
fn = len(loans[fn_filter])
fp_filter = (predictions == 1) & (loans["loan_status"] == 0)
fp = len(loans[fp_filter])
tpr = tp / (tp + fn)
fpr = fp / (fp + tn)
print(tpr)
print(fpr)